Learning String Edit Distance
نویسندگان
چکیده
In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string edit distance. Our stochastic model allows us to learn the optimal string edit distance function from a corpus of examples. We illustrate the utility of our approach by applying it to the di cult problem of learning the pronunciation of words in conversational speech. In this application, we learn a string edit distance function with one third the error rate of the untrained Levenshtein distance.
منابع مشابه
Learning String Edit Distance 1
In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string edit distance. Our stochastic model allows us to learn a string edit distance func...
متن کاملMAUL: Machine Agent User Learning∗
We describe implementation of a classifier for User-Agent strings using Support Vector Machines. The best kernel is found to be the linear kernel, even when more complicated string based kernels, such as the edit distance kernel and the subsequence kernel, are employed. A robust tokenization scheme is employed which dramatically speeds up the calculation for the edit string and subsequence kern...
متن کاملLearning Balls of Strings from Edit Corrections
When facing the question of learning languages in realistic settings, one has to tackle several problems that do not admit simple solutions. On the one hand, languages are usually defined by complex grammatical mechanisms for which the learning results are predominantly negative, as the few algorithms are not really able to cope with noise. On the other hand, the learning settings themselves re...
متن کاملLearning state machine-based string edit kernels
During the past few years, several works have been done to derive string kernels from probability distributions. For instance, the Fisher kernel uses a generative model M (e.g. a hidden markov model) and compares two strings according to how they are generated by M . On the other hand, the marginalized kernels allow the computation of the joint similarity between two instances by summing condit...
متن کاملDetecting English-French Cognates Using Orthographic Edit Distance
Identification of cognates is an important component of computer assisted second language learning systems. We present a simple rule-based system to recognize cognates in English text from the perspective of the French language. At the core of our system is a novel similarity measure, orthographic edit distance, which incorporates orthographic information into string edit distance to compute th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IEEE Trans. Pattern Anal. Mach. Intell.
دوره 20 شماره
صفحات -
تاریخ انتشار 1997